
Wine quality analysisΒΆ
IntroductionΒΆ
I used the Wine Quality dataset available at the UC Irvine Machine Learning Repository. This dataset contains physicochemical and quality-related data about vinho verde, a unique wine produced in the northern region of Portugal. The data is divided into two subsets: red and white wine samples.
The main goal of my analysis is to investigate which features have the greatest influence on the wine's quality and to determine whether their impact is positive or negative.
FeaturesΒΆ
| Feature | Unit | Description |
|---|---|---|
| Fixed acidity | g (tartaric acid)/dmΒ³ | The amount of non-volatile acids (mainly tartaric and malic acid) that do not evaporate during fermentation. |
| Volatile acidity | g (acetic acid)/dmΒ³ | The amount of acetic acid present. High levels can cause an undesirable vinegar-like aroma and spoilage. |
| Citric acid | g/dmΒ³ | Naturally occurring acid in small quantities. Enhances freshness in the wine. |
| Residual sugar | g/dmΒ³ | The amount of sugar remaining after fermentation. Higher values result in a sweeter taste, while lower values indicate a drier wine. |
| Chlorides | g (sodium chloride)/dmΒ³ | The concentration of sodium chloride (salt) in the wine. |
| Free sulfur dioxide | mg/dmΒ³ | The portion of sulfur dioxide (SOβ) unbound in the wine, acting as a preservative against oxidation and microbial spoilage. |
| Total sulfur dioxide | mg/dmΒ³ | The total amount of both free and bound sulfur dioxide. Excessive concentrations may negatively impact aroma and flavor. |
| Density | g/cmΒ³ | The density of the wine. |
| pH | - | Measures the strength of acidity on the pH scale (0 = highly acidic, 14 = highly alkaline). |
| Sulphates | g (potassium sulphate)/dmΒ³ | Additive used for its antioxidant and antimicrobial properties. May slightly increase bitterness and improve preservation. |
| Alcohol | % | The percentage of ethyl alcohol in the wine. |
The target variable is the wine quality, represented as a score on a scale from 0 to 10.
%matplotlib inline
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import itertools
from ydata_profiling import ProfileReport
from pandas.plotting import parallel_coordinates
Improve your data and profiling with ydata-sdk, featuring data quality scoring, redundancy detection, outlier identification, text validation, and synthetic data generation.
# Read in the red wine data to a dataframe
df_red = pd.read_csv("winequality-red.csv", sep=";")
# Read in the white wine data to a dataframe
df_white = pd.read_csv("winequality-white.csv", sep=";")
df_red
| fixed acidity | volatile acidity | citric acid | residual sugar | chlorides | free sulfur dioxide | total sulfur dioxide | density | pH | sulphates | alcohol | quality | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 7.4 | 0.700 | 0.00 | 1.9 | 0.076 | 11.0 | 34.0 | 0.99780 | 3.51 | 0.56 | 9.4 | 5 |
| 1 | 7.8 | 0.880 | 0.00 | 2.6 | 0.098 | 25.0 | 67.0 | 0.99680 | 3.20 | 0.68 | 9.8 | 5 |
| 2 | 7.8 | 0.760 | 0.04 | 2.3 | 0.092 | 15.0 | 54.0 | 0.99700 | 3.26 | 0.65 | 9.8 | 5 |
| 3 | 11.2 | 0.280 | 0.56 | 1.9 | 0.075 | 17.0 | 60.0 | 0.99800 | 3.16 | 0.58 | 9.8 | 6 |
| 4 | 7.4 | 0.700 | 0.00 | 1.9 | 0.076 | 11.0 | 34.0 | 0.99780 | 3.51 | 0.56 | 9.4 | 5 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 1594 | 6.2 | 0.600 | 0.08 | 2.0 | 0.090 | 32.0 | 44.0 | 0.99490 | 3.45 | 0.58 | 10.5 | 5 |
| 1595 | 5.9 | 0.550 | 0.10 | 2.2 | 0.062 | 39.0 | 51.0 | 0.99512 | 3.52 | 0.76 | 11.2 | 6 |
| 1596 | 6.3 | 0.510 | 0.13 | 2.3 | 0.076 | 29.0 | 40.0 | 0.99574 | 3.42 | 0.75 | 11.0 | 6 |
| 1597 | 5.9 | 0.645 | 0.12 | 2.0 | 0.075 | 32.0 | 44.0 | 0.99547 | 3.57 | 0.71 | 10.2 | 5 |
| 1598 | 6.0 | 0.310 | 0.47 | 3.6 | 0.067 | 18.0 | 42.0 | 0.99549 | 3.39 | 0.66 | 11.0 | 6 |
1599 rows Γ 12 columns
df_white
| fixed acidity | volatile acidity | citric acid | residual sugar | chlorides | free sulfur dioxide | total sulfur dioxide | density | pH | sulphates | alcohol | quality | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 7.0 | 0.27 | 0.36 | 20.7 | 0.045 | 45.0 | 170.0 | 1.00100 | 3.00 | 0.45 | 8.8 | 6 |
| 1 | 6.3 | 0.30 | 0.34 | 1.6 | 0.049 | 14.0 | 132.0 | 0.99400 | 3.30 | 0.49 | 9.5 | 6 |
| 2 | 8.1 | 0.28 | 0.40 | 6.9 | 0.050 | 30.0 | 97.0 | 0.99510 | 3.26 | 0.44 | 10.1 | 6 |
| 3 | 7.2 | 0.23 | 0.32 | 8.5 | 0.058 | 47.0 | 186.0 | 0.99560 | 3.19 | 0.40 | 9.9 | 6 |
| 4 | 7.2 | 0.23 | 0.32 | 8.5 | 0.058 | 47.0 | 186.0 | 0.99560 | 3.19 | 0.40 | 9.9 | 6 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 4893 | 6.2 | 0.21 | 0.29 | 1.6 | 0.039 | 24.0 | 92.0 | 0.99114 | 3.27 | 0.50 | 11.2 | 6 |
| 4894 | 6.6 | 0.32 | 0.36 | 8.0 | 0.047 | 57.0 | 168.0 | 0.99490 | 3.15 | 0.46 | 9.6 | 5 |
| 4895 | 6.5 | 0.24 | 0.19 | 1.2 | 0.041 | 30.0 | 111.0 | 0.99254 | 2.99 | 0.46 | 9.4 | 6 |
| 4896 | 5.5 | 0.29 | 0.30 | 1.1 | 0.022 | 20.0 | 110.0 | 0.98869 | 3.34 | 0.38 | 12.8 | 7 |
| 4897 | 6.0 | 0.21 | 0.38 | 0.8 | 0.020 | 22.0 | 98.0 | 0.98941 | 3.26 | 0.32 | 11.8 | 6 |
4898 rows Γ 12 columns
The dataset includes 1599 red wine samples and 4898 white wine samples. This difference is likely because vinho verde is more commonly produced as a white wine.
White and red wine characteristicsΒΆ
The analysis will be performed on both red and white wines, but the results may differ due to their distinct characteristics. The table below shows the distribution of all features for both red and white wines in order to compare them.
# List of features to compare
features = ['fixed acidity', 'volatile acidity', 'citric acid', 'density', 'residual sugar',
'sulphates', 'alcohol', 'quality', 'chlorides', 'free sulfur dioxide',
'total sulfur dioxide', 'pH']
# Set the number of rows and columns for the subplots
n_rows = 6
n_cols = 2
# Create subplots
fig, axes = plt.subplots(n_rows, n_cols, figsize=(10, 20))
# Flatten the axes array for easy access
axes = axes.flatten()
# Plot each feature for both red and white wines
for i, feature in enumerate(features):
# Plot for white wines (column 1)
axes[i].hist(df_white[feature], bins=20, color='#eac371', alpha=0.7, label='White wine')
# Plot for red wines (column 2)
axes[i].hist(df_red[feature], bins=20, color='#c7413c', alpha=0.7, label='Red wine')
# Set labels and titles
axes[i].set_xlabel(feature, fontsize=12)
axes[i].set_ylabel('Frequency', fontsize=12)
axes[i].set_title(f'Distribution of {feature} by wine type', fontsize=14)
axes[i].legend()
# Adjust layout
plt.tight_layout()
# Show the plot
plt.show()
Observations:ΒΆ
- Red wines tend to have higher acidity, reflected in their higher levels of volatile acidity and fixed acidity.
- Red wines generally contain more sulphates, which may be related to the winemaking process, such as longer fermentation and aging periods.
- White wines typically have higher levels of residual sugar, which is consistent with the general characteristic of white wines being sweeter, while red wines are typically drier.
- White wines contain more sulfur dioxide, which is used as a preservative.
# Run the profiling for red wine
profile = ProfileReport(df_red, title="Red wine")
profile
Summarize dataset: 0%| | 0/5 [00:00<?, ?it/s]
100%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 12/12 [00:00<?, ?it/s]
Generate report structure: 0%| | 0/1 [00:00<?, ?it/s]
Render HTML: 0%| | 0/1 [00:00<?, ?it/s]